Group observations into sets and summarise on those sets.

group_by : group observations

You can group your data given a set of variables. For example we can group per age and gender:

pulse %>% group_by( age,gender )
# A tibble: 110 × 13
# Groups:   age, gender [21]
   id     name  height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
   <chr>  <chr>  <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
 1 1993_A Bonn…    173     57    18 female no     yes     modera… sat       86     88
 2 1993_B Mela…    179     58    19 female no     yes     modera… ran       82    150
 3 1993_C Cons…    167     62    18 female no     yes     high    ran       96    176
 4 1993_D Trav…    195     84    18 male   no     yes     high    sat       71     73
 5 1993_E Lauri    173     64    18 female no     yes     low     sat       90     88
 6 1993_F Geor…    184     74    22 male   no     yes     low     ran       78    141
 7 1993_G Cher…    162     57    20 female no     yes     modera… sat       68     72
 8 1993_H Fran…    169     55    18 female no     yes     modera… sat       71     77
 9 1993_I Sonja    164     56    19 female no     yes     high    sat       68     68
10 1993_J Troy     168     60    23 male   no     yes     modera… ran       88    150
# … with 100 more rows, 1 more variable: year <dbl>, and abbreviated variable name
#   ¹​exercise

Note the tag Groups: age, gender [21] in the output, meaning that group_by found 21 {age,gender} groups in our dataset.

Once groups are marked with group_by, then we can analyse those groups with summarise producing a single row output per group. For example we can count the number of observation per group using function n():

pulse %>% group_by( age, gender ) %>% 
          summarise( n = n())
# A tibble: 21 × 3
# Groups:   age [13]
     age gender     n
   <dbl> <chr>  <int>
 1    18 female    15
 2    18 male       8
 3    19 female    14
 4    19 male      15
 5    20 female    11
 6    20 male      16
 7    21 female     3
 8    21 male       7
 9    22 female     1
10    22 male       5
# … with 11 more rows

⚠️You may get a warning like ‘summarise() regrouping … (override with .groups argument)’. It is a reminder to remove the group from the result set. You may ignore this or set the ‘summarise’ argument .groups=‘drop’, seel also ?summarise.

pulse %>% group_by( age, gender ) %>% 
          summarise( n = n(), .groups='drop')
# A tibble: 21 × 3
     age gender     n
   <dbl> <chr>  <int>
 1    18 female    15
 2    18 male       8
 3    19 female    14
 4    19 male      15
 5    20 female    11
 6    20 male      16
 7    21 female     3
 8    21 male       7
 9    22 female     1
10    22 male       5
# … with 11 more rows

Which function produces the same output as above given {age,gender}?

‘count’ function:

puls %>% count(age,gnder)


But of course we want to do more than just count the size of the groups. We can for example calculate the mean height and weight per {age,gender} group:

pulse %>% group_by(age,gender) %>% 
          summarise(size=n(),meanHeight=mean(height), meanWeight=mean(weight))
# A tibble: 21 × 5
# Groups:   age [13]
     age gender  size meanHeight meanWeight
   <dbl> <chr>  <int>      <dbl>      <dbl>
 1    18 female    15       168.       58.9
 2    18 male       8       183.       74.4
 3    19 female    14       160.       52  
 4    19 male      15       173.       72.8
 5    20 female    11       166.       58.0
 6    20 male      16       178.       74.5
 7    21 female     3       172        57.3
 8    21 male       7       180.       76  
 9    22 female     1       151        42  
10    22 male       5       178.       76.4
# … with 11 more rows

Queries

We now have all the tools we need to apply more complex queries on our data. For example, group per gender on those that ran and summarize on mean age, pulse1 and pulse2. First we need to filter only those who ran (see explanation on ran in pulse) and only then group and summarise:

pulse %>% filter(ran == "ran") %>% 
          group_by( gender ) %>% 
          summarise( size = n(), meanAge= mean(age), meanPluse1 = mean(pulse1), meanPulse2 = mean( pulse2 ) )
# A tibble: 2 × 5
  gender  size meanAge meanPluse1 meanPulse2
  <chr>  <int>   <dbl>      <dbl>      <dbl>
1 female    22    20.6       75.5       126.
2 male      24    19.8       75.5       128.

now for those who sat:

pulse %>% filter(ran == "sat") %>% group_by( gender ) %>% 
    summarise( size = n(), meanAge= mean(age), meanPluse1 = mean(pulse1), meanPulse2 = mean( pulse2 ) )
# A tibble: 2 × 5
  gender  size meanAge meanPluse1 meanPulse2
  <chr>  <int>   <dbl>      <dbl>      <dbl>
1 female    29    19.6       NA         NA  
2 male      35    21.9       73.3       72.3

Note that there are missing values, account for it by using na.rm=TRUE:

pulse %>% filter(ran == "sat") %>% 
          group_by( gender ) %>% 
          summarise( count = n(), meanAge= mean(age), 
               meanPluse1 = mean(pulse1,na.rm=TRUE), meanPulse2 = mean( pulse2 , na.rm=TRUE) )
# A tibble: 2 × 5
  gender count meanAge meanPluse1 meanPulse2
  <chr>  <int>   <dbl>      <dbl>      <dbl>
1 female    29    19.6       79.1       78  
2 male      35    21.9       73.3       72.3

⚠️ It is good practice to check your results. For example, the group sizes in the original survey table who ran and sat must match the sum of sizes in the different summaries shown above under ‘size’ column. For ran==“ran” summary we have 22+24=44 and for ran==“sat” summary we have 29+35=64. We can check them against totals below and we see that they do:

pulse %>%  count(ran) 
# A tibble: 2 × 2
  ran       n
  <chr> <int>
1 ran      46
2 sat      64

ungroup : remove grouping

To remove the grouping use we have ungroup function:

pulse %>% group_by(age,gender) %>% ungroup() # results in the original pulse tibble
# A tibble: 110 × 13
   id     name  height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
   <chr>  <chr>  <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
 1 1993_A Bonn…    173     57    18 female no     yes     modera… sat       86     88
 2 1993_B Mela…    179     58    19 female no     yes     modera… ran       82    150
 3 1993_C Cons…    167     62    18 female no     yes     high    ran       96    176
 4 1993_D Trav…    195     84    18 male   no     yes     high    sat       71     73
 5 1993_E Lauri    173     64    18 female no     yes     low     sat       90     88
 6 1993_F Geor…    184     74    22 male   no     yes     low     ran       78    141
 7 1993_G Cher…    162     57    20 female no     yes     modera… sat       68     72
 8 1993_H Fran…    169     55    18 female no     yes     modera… sat       71     77
 9 1993_I Sonja    164     56    19 female no     yes     high    sat       68     68
10 1993_J Troy     168     60    23 male   no     yes     modera… ran       88    150
# … with 100 more rows, 1 more variable: year <dbl>, and abbreviated variable name
#   ¹​exercise


Copyright © 2023 Biomedical Data Sciences (BDS) | LUMC